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ABSTRACT 

Gathering labeled data in educational data mining (EDM) 
is a time and cost intensive task. However, the amount 
of available training data directly influences the quality of 
predictive models. Unlabeled data, on the other hand, is 
readily available in high volumes from intelligent tutoring 
systems and massive open online courses. In this paper, we 
present a semi-supervised classification pipeline that makes 
effective use of this unlabeled data to significantly improve 
model quality. We employ deep variational auto-encoders 
to learn efficient feature embeddings that improve the per- 
formance for standard classifiers by up to 28% compared 
to completely supervised training. Further, we demonstrate 
on two independent data sets that our method outperforms 
previous methods for finding efficient feature embeddings 
and generalizes better to imbalanced data sets compared 
to expert features. Our method is data independent and 
classifier-agnostic, and hence provides the ability to improve 
performance on a variety of classification tasks in EDM. 
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1. INTRODUCTION 


Building predictive models of student characteristics such 
as knowledge level, learning disabilities, personality traits 
or engagement is one of the big challenges in educational 
data mining (EDM). Such detailed student profiles allow 
for a better adaptation of the curriculum to the individual 
needs and is crucial for fostering optimal learning progress. 
In order to build such predictive models, smaller-scale and 
controlled user studies are typically conducted where de- 
tailed information about student characteristics are at hand 
(labeled data). The quality of the predictive models, how- 
ever, inherently depends on the number of study partici- 
pants, which is typically on the lower side due to time and 
budget constraints. In contrast to such controlled user stud- 
ies, digital learning environments such as intelligent tutoring 
systems (ITS), educational games, learning simulations, and 
massive open online courses (MOOCs) produce high volumes 
of data. These data sets provide rich information about stu- 
dent interactions with the system, but come with no or only 
little additional information about the user (unlabeled data). 
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Semi-supervised learning bridges this gap by making use of 
patterns in bigger unlabeled data sets to improve predictions 
on smaller labeled data sets. This is also the focus of this 
paper. These techniques are well explored in a variety of 
domains and it has been shown that classifier performance 
can be improved for, e.g., image classification [15], natu- 
ral language processing [28] or acoustic modeling [21]. In 
he education community, semi-supervised classification has 
been used employing self-training, multi-view training and 
problem-specific algorithms. Self-training has e.g. been ap- 
plied for problem-solving performance [22]. In self-training, 
a classifier is first trained on labeled data and is then itera- 
ively retrained using its most confident predictions on un- 
labeled data. Self-training has the disadvantage that incor- 
rect predictions decrease the quality of the classifier. Multi- 
view training uses different data views and has been explored 
with co-training [27] and tri-training [18] for predicting pre- 
requisite rules and student performance, respectively. The 
performance of these methods, however, largely depends on 
he properties of the different data views, which are not yet 
fully understood [34]. Problem-specific semi-supervised al- 
gorithms have been used to organize learning resources in 
he web [19], with the disadvantage that they cannot be 
directly applied for other classification tasks. 


Recently, it has been shown (outside of the education con- 
ext) that variational auto-encoders (VAE) have the poten- 
ial to outperform the commonly used semi-supervised clas- 
sification techniques. VAE is a neural network that includes 
an encoder that transforms a given input into a typically 
lower-dimensional representation, and a decoder that recon- 
structs the input based on the latent representation. Hence, 
VAEs learn an efficient feature embedding (feature repre- 
sentation) using unlabeled data that can be used to im- 
prove the performance of any standard supervised learning 
algorithm [15]. This property greatly reduces the need for 
problem-specific algorithms. Moreover, VAEs feature the 
advantage that the trained deep generative models are able 
to produce realistic samples that allow for accurate data 
imputation and simulations [23], which makes them an ap- 
pealing choice for EDM. Inspired by these advantages, and 
the demonstrated superior classifier performance in other 
domains as in computer vision (16, 23], this paper explores 
VAE for student classification in the educational context. 
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We present a complete semi-supervised classification pipeline 
that employs deep VAEs to extract efficient feature embed- 
dings from unlabeled student data. We have optimized the 
architecture of two different networks for educational data - 
a simple variational auto-encoder and a convolutional varia- 
tional auto-encoder. While our method is generic and hence 
widely applicable, we apply the pipeline to the problem of 
detecting students suffering from developmental dyscalculia 
(DD), which is a learning disability in arithmetics. The large 
and unlabeled data set at hand consists of student data of 
more than 7K students and we evaluate the performance of 
our pipeline on two independent small and labeled data sets 
with 83 and 155 students. Our evaluation first compares the 
performance of the two networks, where our results indicate 
superiority of the convolutional VAE. We then apply dif- 
ferent classifiers to both labeled data sets, and demonstrate 
not only improvements in classification performance of up to 
28% compared to other feature extraction algorithms, but 
also improved robustness to class imbalance when using our 
pipeline compared to other feature embeddings. The im- 
proved robustness of our VAE is especially important for 
predicting relatively rare student conditions - a challenge 
that is often met in EDM applications. 


2. BACKGROUND 


In the semi-supervised classification setting we have access 
o a large data set Vg without labels and a much smaller 
labeled data set Vs with labels Ys. The idea behind semi- 
supervised classification is to make use of patterns in the 
unlabeled data set to improve the quality of the classifier 
beyond what would be possible with the small data set 
Xs alone. There are many different approaches to semi- 
supervised classification including transductive SVMs, graph- 
based methods, self-training or representation learning [35]. 
In this work we focus on learning an efficient encoding z = 
E(x) for x € 4g of the data domain using the unlabeled 
data Vg only. This learnt data transformation F(-) - the 
encoding - is then applied to the labeled data set Vs. Well- 
known encoders include principle component analysis (PCA) 
or Kernel PCA (KPCA). PCA is a dimensionality reduction 
method that finds the optimal linear transformation from 
an N-dimensional to a K-dimensional space (given a mean- 
squared error loss). Kernel PCA [24] extends PCA allowing 
non-linear transformations into a K-dimensional space and 
has, among others, been successfully used for novelty detec- 
tion in non-linear domains [11]. Recently, variational auto- 
encoders (VAE) have outperformed other semi-supervised 
classification techniques on several data sets [15]. VAE com- 
bine variational inference networks with generative models 
parametrized by deep neural networks that exploit informa- 
tion in the data density to find efficient lower dimensional 
representations (feature embeddings) of the data. 


Auto-encoder. An auto-encoder or autoassociator [2] is a 
neural network that encodes a given input into a (typically 
lower dimensional) representation such that the original in- 
put can be reconstructed approximately. The auto-encoder 
consists of two parts. The encoder part of the network takes 
the N-dimensional input x € R% and computes an encod- 
ing z = E(x) while the decoder D reconstructs the input 
based on the latent representation x = D(z). If we train 
a network using the mean squared error loss and the net- 
work consists of a single linear hidden layer of size K, e.g. 


E(x) = Wix +b; and D(z) = Wez + be for weights 
W, ¢ R**% and We € R*** and offsets bi € R® and 
be € RY, the autoencoder behaves similar to PCA in that 
the network learns to project the input into the span of 
the K first principle components [2]. For more complex net- 
works with non-linear layers multi-modal aspects of the data 
can be learnt. Auto-encoders can be used in semi-supervised 
classification tasks because the encoder can compute a fea- 
ture representation z of the original data x. These features 
can then be used to train a classifier. The learnt feature 
embedding facilitates classification by clustering related ob- 
servations in the computed latent space. 


Variational auto-encoder. Variational auto-encoders [15] 
are generative models that combine Bayesian inference with 
deep neural networks. They model the input data x as 


po(x|z) = f(x; 2, 8) p(z) = N(z|0, 1) (1) 


where f is a likelihood function that performs a non-linear 
transformation with parameters 0 of z by employing a deep 
neural network. In this model the exact computation of 
the posterior pg(z|x) is not computationally tractable. In- 
stead, the true posterior is approximated by a distribution 
qo(z|x) [16]. This inference network qy(z|x) is parametrized 
as a multivariate normal distribution as 


qs(2|x) = N(z|ps(x), diag(o3(x))), (2) 


where j13(x) and 0% (x) denote vectors of means and variance 
respectively. Both functions jug(-) and o3(-) are represented 
as deep neural networks. Hence, variational autoencoders 
essentially replace the deterministic encoder E(x) and de- 
coder D(z) by a probabilistic encoder q¢(z|x) and decoder 
po(x|z). Direct maximization of the likelihood is computa- 
tionally not tractable, therefore a lower bound on the likeli- 
hood has been derived [16]. The learning task then amounts 
to maximizing this variational lower bound 


ag (z\x) [log po(x|z)] — KL [go (z|x)||p(@)] , (3) 


where KL denotes the Kullback-Leibler divergence. The 
lower bound consists of two intuitive terms. The first term 
is the reconstruction quality while the second one regular- 
izes the latent space towards the prior p(z). We perform 
optimization of this lower bound by applying a stochastic 
optimization method using gradient back-propagation [14]. 


3. METHOD 


In the following we introduce two networks. First, a simple 
variational auto-encoder consisting of fully connected lay- 
ers to learn feature embeddings of student data. These en- 
coders have shown to be powerful for semi-supervised clas- 
sification [15], and are often applied due to their simplicity. 
Second, an advanced auto-encoder that combines the advan- 
tages of VAE with the superiority of asymmetric encoders. 
This is motivated by the fact that asymmetric auto-encoders 
have shown superior performance and more meaningful fea- 
ture representations compared to simple VAE in other do- 
mains such as image synthesis [29]. 


Student snapshots. There are many applications where 
we want to predict a label y, for each student n within an 
ITS based on behavioral data Xn. These labels typically 
relate to external variables or properties of a student, such 
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Figure 1: Network layouts for our simple student auto-encoder (left) using only fully connected layers and our 
improved CNN student auto-encoder (right) using convolutions for the encoder and recurrent LSTM layers 
for the decoder. In contrast to standard auto-encoders, the connections to the latent space z are sampled 


(red dashed arrows) from a Gaussian distribution. 


as age, learning disabilities, personality traits, learner types, 
learning outcome etc. Similar to Knowledge Tracing (KT) 
we propose to model the data X, = {Xni,...,Xnr} asa 
sequence of T observations. In contrast to KT we store F 
different feature values x,; € R* for each element in the 
sequence, where t denotes the t®” opportunity within a task. 
This allows us to simultaneously store data from multiple 
tasks in Xnt, e.g. Xni stores all features for student n that 
were observed during the first task opportunities. For ev- 
ery task in an ITS we can extract various different features 
that characterize how a student n was approaching the task. 
These features include performance, answer times, problem 
solving strategies, etc. We combine this information into a 
student snapshot X, € R™*”, where T is the number of task 
opportunities and F' is the number of extracted features. 


Simple student auto-encoder (S-SAE). Our simple vari- 
ational autoencoder is following the general design outlined 
in Section 2 and is based on the student snapshot represen- 
tation. For ease of notation we use x := vec(Xn), where 
vec(-) is the matrix vectorization function to represent the 
student snapshot of student n. The complete network lay- 
out is depicted in Figure 1, left. The encoder and decoder 
networks consist of L fully connected layers that are imple- 
mented as an affine transformation of the input followed by 
a non-linear activation function 6(-) as x; = 6(W7x;-1+bz), 
where | is the layer index and W, and b; are a weight matrix 
and offset vector of suitable dimensions. Typical choices for 
8(-) include tanh, rectified linear units or sigmoid functions 
[6]. To produce latent samples z we sample from the normal 
distribution (see Equation (2)) using re-parametrization [16] 


2 = p(x) + o4(x)¢, (4) 


where « ~ N(0,1), to allow for back-propagation of gra- 
dients. For po(x|z) (see (1)) any suitable likelihood func- 
tion can be used. We used a Gaussian distribution for all 
presented examples. Note that the likelihood function is 
parametrized by the entire (non-linear) decoder network. 


The training of variational auto-encoders can be challenging 
as stochastic optimization was found to set qg(z|x) = p(z) 
in all but vanishingly rare cases [3], which corresponds to a 
local maximum that does not use any information from x. 
We therefore add a warm-up phase that gradually gives the 
regularization term in the target function more weight: 


44 (z\x) [log po(x|z)] — aKL [go(z|x)|Ip(2)], (8) 


where a € [0,1] is linearly increased with the number of 
epochs. The warm-up phase has been successfully used 
for training deep variational auto-encoders [25]. Further- 
more, we initialize the weights of the dense layer computing 
log(o3,(x)) to 0 (yielding a variance of 1 at the beginning of 
the training). This was motivated by our observations that if 
we employ standard random weight initialization techniques 
(glorot-norm, he-norm [9]) we can get relatively high initial 
estimates for the variance o3(x), which due to the sampling 
leads to very unreliable samples z in the latent space. The 
large variance in sampled points in the latent space leads to 
bad convergence properties of the network. 


CNN student auto-encoder (CNN-SAE). Following 
the recent findings in computer vision we present a second, 
more advanced network that typically outperforms simpler 
VAE. In [29], for example, these asymmetric auto-encoders 
resulted in superior reconstruction of images as well as more 
meaningful feature embeddings. A specific kind of convolu- 
tional neural network was combined with an auto-encoder, 
being able to directly capture low level pixel statistics and 
hence to extract more high-level feature embeddings. 


Inspired by this previous work, we combine an asymmetric 
auto-encoder (and a decoder that is able to capture low level 
statistics) with the advantages of variational auto-encoders. 
Figure 1, right, shows our combined network. We employ 
multiple layers of one-dimensional convolutions to parametrize 
the encoder qy(z|x) (again we assume a Gaussian distribu- 
tion, see (2)). The distribution is parametrized as follows: 


Me (x) = Wyh + by 
log(o3(x)) = Woh + bo 
h = conv;(x) = B( Wi * convi_1(x)), 


where * is the convolution operator, W;, W,, W. are weights 
of suitable dimensions, $(-) is a non-linear activation func- 
tion and / denotes the layer depth. Further, convo(x) = x. 
We keep the standard variational layer (see (4)) while chang- 
ing the output layer to a recurrent layer using long term 
short term units (LSTM). Recurrent layers have success- 
fully been used in auto-encoders before, e.g. in [5]. LSTM 
were very successful for modeling temporal sequences be- 
cause they can model long and short term dependencies be- 
tween time steps. Every LSTM unit receives a copy of the 
sampled points in latent-space, which allows the LSTM net- 
work to combine context information (point in the latent 
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Figure 2: Pipeline overview. We train the variational auto-encoder on a large unlabeled data set. The trained 
encoder of the auto-encoder can be used to transform other data sets into an expressive feature embedding. 
Based on this feature embedding we train different classifiers to predict the student labels. 


space) with the sequence information (memory unit in the 
LSTM cell). Using LSTM cells the decoder po(x|z) assumes 
a Gaussian distribution and is parametrized as follows: 


Her(Z) = Wz - Istmi(z) + Byz 
log(o§1(z)) = W.. : Istm:(z) + bez, 


where ju9+(z) and 04;(z) are the t’” components of jig(z) and 
o4(z), respectively, Istm:(-) denotes the t’” LSTM cell and 
W.. and b, denote suitable weight and offset parameters. 


Feature selection. VAE provide a natural way for per- 
forming feature selection. The inference network q¢(z|x) 
infers the mean and variance for every dimension z;. There- 
fore, the most informative dimension z; has the highest KL 
divergence from the prior distribution p(z;) = (0,1) while 
uninformative dimensions will have a KL divergence close to 
0 [10]. The KL divergence of z; to p(z:) is given by 


2,2 
OF LG 1 
log(ai) +S —5, (6) 


KL (96(zi|x)||p(2)] = 


where pj; and o; are the inferred parameter for the Gaussian 
distribution q¢(zi|x). Feature selection proceeds by keeping 
the K dimensions z; with the largest KL divergence. 


Semi-supervised classification pipeline. The encoder 
and the decoder of the variational auto-encoder can be used 
independently of each other. This independence allows us 
to take the trained encoder and map new data to the learnt 
feature embedding. Figure 2 provides an overview of the 
entire pipeline for semi-supervised classification. In a first 
unsupervised step we train a VAE on unlabeled data. The 
learnt encoder qg(z|x) is then used to transform labeled data 
sets to the feature embedding. We finally apply our feature 
selection step that considers the relative importance of the 
latent dimensions as previously described. We then train 
standard classifiers (Logistic Regression, Naive Bayes and 
Support Vector Machine) on the feature embeddings. 


4. RESULTS 


We evaluated our approach for the specific example of de- 
tecting developmental dyscalculia (DD), which is a learning 
disability affecting the acquisition of arithmetic skills [33]. 
Based on the learnt feature embedding on a large unlabeled 
data set the classifier performance was measured on two in- 
dependent, small and labeled data sets from controlled user 
studies. We refer to them as balanced and imbalanced data 
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sets since their distribution of DD and non-DD children dif- 
fers: the first study has approximately 50% DD, while the 
second one includes 5% DD (typical prevalence of DD). 


4.1 Experimental Setup 

All three data sets were collected from Calcularis, which is 
an intelligent tutoring system (ITS) targeted at elementary 
school children suffering from DD or exhibiting difficulties 
in learning mathematics [13]. Calcularis consists of different 
games for training number representations and calculation. 
Previous work identified a set of games that are predictive 
of DD within Calcularis [17]. Since timing features were 
found to be one of the most relevant indicators for detecting 
DD [4] and to facilitate comparison to other feature embed- 
ding techniques we limited our analysis to log-normalized 
timing features, for which we can assume normal distribu- 
tion [30]. Therefore, we evaluated our pipeline on the sub- 
set of games from [17] for which meaningful timing features 
could be extracted and sufficient samples were available in all 
data sets (we used >7000 samples for training the VAEs). 
Since our pipeline currently does not handle missing data 
only students with complete data were included. 


Timing features were extracted for the first 5 tasks in 5 dif- 
ferent games. The selected games involve addition tasks 
(adding a 2-digit number to a 1-digit number with ten- 
crossing; adding two 2-digit numbers with ten-crossing), num- 
ber conversion (spoken to written numbers in the ranges 0- 
10 and 0-100) and subtraction tasks (subtracting a 1-digit 
number from a 2-digit number with ten-crossing). For every 
task we extracted the total answer time (time between the 
task prompt until the answer was entered) and the response 
time (time between the task prompt and the first input by 
the student). Hence, each student is represented by a 50- 
dimensional snapshot x (see Section 3). 


Unlabeled data set. The unlabeled data set was extracted 
using live interaction logs from the ITS Calcularis. In total, 
we collected data from 7229 children. Note that we have 
no additional information about the children such as DD or 
grade. We excluded all teacher accounts as well as log files 
that were < 20KB. Since every new game in Calcularis is 
introduced by a short video during the very first task, we 
excluded this particular task for all games. 


Balanced data set. The first labeled data set is based 
on log files from 83 participants of a multi-center user study 
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conducted in Germany and Switzerland, where approximately 
half of the participants were diagnosed with DD (47 DD, 36 

control) [31]. During the study, children trained with Cal- 

cularis at home for five times per week during six weeks and 

solved on average 1551 tasks. There were 28 participants 

in 2"? grade (9 DD, 19 control), 40 children in 3"? grade 

(23 DD, 17 control), 12 children in 4*” grade (12 DD) and 

3 children in 5‘ grade (3 DD). The diagnosis of DD was 

based on standardized neuropsychological tests [31]. 


Imbalanced data set. The second labeled data set is from 
a user study conducted in the classroom of ten Swiss elemen- 
tary school classes. In total, 155 children participated, and 
a prevalence of DD of 5% could be detected (8 DD, 147 con- 
trol). There were 97 children in 2”? grade (3 DD, 94 control) 
and 58 children in 3"¢ grade (5 DD, 53 control). The DD di- 
agnosis was computed based on standardized tests assessing 
the mathematical abilities of the children [32, 7]. During the 
study, children solved 85 tasks directly in the classroom. On 
average, children needed 26 minutes to complete the tasks. 


Implementation. The unlabeled data set was used to train 
the unsupervised VAE for extracting compact feature em- 
beddings of the data. Based on the learnt data transforma- 
tions we evaluated two standard classifiers: Logistic Regres- 
sion (LR) and Naive Bayes (NB). We restricted our evalu- 
ation to simple classification models because we wanted to 
assess the quality of the feature embedding and not the qual- 
ity of the classifier. More advanced classifiers typically per- 
form a (sometimes implicit) feature transformation as part 
of their data fitting procedure. To represent at least one 
model that performs such an embedding we included Sup- 
port Vector Machine (SVM) in all our results. All classifier 
parameters were chosen according to the default values in 
sctkit-learn. Note that we have additionally performed ran- 
domized cross-validated hyper-parameter search for all clas- 
sifiers, which, however, resulted in marginal improvements 
only. Because of that, and to keep the model simple and es- 
pecially easily reproducible, we use the default parameter set 
in this work. For Logistic Regression we used L2 regulariza- 
tion with C = 1, for Naive Bayes we used Gaussian distribu- 
tions and for the SVM RBF kernels and data point weights 
have been set inversely proportional to label frequencies. All 
results are cross-validated using 30 randomized training-test 
splits on the unlabeled data (test size 5%). The classification 
part of the pipeline is additionally cross-validated using 300 
label-stratified random training-test splits (test size 20%) to 
ensure highly reproducible classification results. 


Network hyper-parameters were defined using the approach 
described in [1]. We increased the number of nodes per 
layer, the number of layers and the number of epochs until 
a good fit of the data was achieved. We then regularized 
the network using dropout [26] with increasing dropout rate 
until the network was no longer overfitting the data. Ac- 
tivation and weight initialization have been chosen accord- 
ing to common standards: We employ the most common 
activation function, namely rectified linear activation units 
(RELU) [20], for all activations. Weight initialization was 
performed using the method by He et al. [9]. Following this 
procedure, the following parameters were used for the S- 
SAE model: encoder and decoders used 3 layers of size 320. 
The CNN-SAE model was parametrized as follows: 3 convo- 
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lution layers with 64 convolution kernels and a filter length 
of 3. We used a single layer of LSTM cells with 80 nodes. 
We used a batch size of 500 samples and batch normaliza- 
tion and dropout (r = 0.25) at every layer. The warm-up 
phase (see Section 3) was set to 300 epochs. Training was 
stopped after 1000 (S-SAE) and 500 (CNN-SAE) epochs. 
The number of latent units was set to 8 in accordance to 
previous work on detecting students with DD that used 17 
features but found that about half of the features were suf- 
ficient to detect DD with high accuracy [17]. When feature 
selection was applied we set the number of features to K =4 
and thus we kept exactly half of the latent space features. 
All networks were implemented using the Keras framework 
with TensorFlow’™ and optimized using Adam stochastic 
optimization with standard parameters according to [14]. 


4.2 Performance comparison 

Our VAE models are trained to extract efficient feature em- 
beddings of the data. To assess the quality of these com- 
puted feature representations, we compare the classification 
performance of our method to previous techniques for find- 
ing efficient feature embeddings, as well as to feature sets 
optimized specifically for the task of predicting DD. 


Network comparison. In a first experiment we compared 
the feature embeddings generated by our simple S-SAE and 
our asymmetric CNN-SAE with and without feature selec- 
tion. Figure 3 illustrates the average ROC curves of our 
complete semi-supervised classification pipeline. Our fea- 
ture embeddings based on asymmetric CNN-SAE clearly 
outperform the ones from the simple S-SAE on both the 
imbalanced and the balanced data set for Naive Bayes (NB) 
and Logistic Regression (LR). For both models, feature se- 
lection improves the area under the ROC curve (AUC) for 
the imbalanced data set (CNN-SAE: LR 4.2%, NB 6.3%; 
S-SAE: LR 6.8%, NB: 1.6%), but has no effect for the bal- 
anced data set. We believe that this is due to the ability of 
the classifiers to distinguish useful features from noisy ones 
given enough samples. Since the performance of the clas- 
sifiers with feature selection (FS) is better or equal to no 
feature selection in each experiment, we used the CNN-SAE 
FS model for all further evaluations. 


Classification performance. In Figure 4 we compare the 
classifier performance for different feature embeddings. We 
compare our method based on VAE to two well-known meth- 
ods for finding optimal feature embeddings, namely principle 
component analysis (PCA, green) and Kernel PCA (KPCA, 
red) [24]. For comparison and as a baseline for the perfor- 
mance of the different methods, we include direct classifi- 
cation results (gray), for which no feature embedding was 
computed. We used Kk = 8 (dimensionality of feature em- 
bedding) for all methods. The features extracted by our 
pipeline compare favorably to PCA and Kernel PCA show- 
ing improvements in terms of AUC of 28% for Logistic Re- 
gression and 23% for Naive Bayes on the imbalanced data 
set and an improvement of 3.75% for Logistic Regression 
and 7.5% for Naive Bayes on the balanced data set. By 
using simple classifiers, we demonstrated that our encoder 
learns an effective feature embedding. More sophisticated 
classifiers (such as SVM with non-linear kernels) typically 
proceed by first embedding the input into a specific feature 
space that is different from the original space. 
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Figure 3: ROC curves for the two proposed mod- 
els with and without feature selection (FS). Our 
asymmetric CNN-SAE outperforms the simple S- 
SAE consistently with (blue) and without (purple) 
feature selection. Feature selection improves perfor- 
mance only on the imbalanced data set. 


For the imbalanced data set the overall performance for 
SVM is significantly lower for all embeddings. This is in line 
with previous work [12] showing that for imbalanced data 
sets, the decision boundaries of SVMs are heavily skewed 
towards the minority class resulting in a preference for the 
majority class and thus a high miss-classification rate for the 
minority class. Indeed, we found that SVM predicted only 
majority labels on the imbalanced data set. For the bal- 
anced data set our feature embedding shows improvements 
of 2.5% over alternative embeddings when using SVM. 


Further, Table 1 shows the performance of all feature embed- 
dings using three additional common classification metrics: 
root mean squared error (RMSE), classification accuracy 
(Acc.) and area under the precision recall curve (AUPR). 
We statistically compared the classification metrics of our 
feature embedding to the best alternative feature embed- 
ding using an independent t-test and Bonferroni correction 
for multiple tests (a = 0.05). Our feature embedding signif- 
icantly outperformed alternative embeddings for all classi- 
fiers on both the balanced and imbalanced data sets on most 
metrics. The main exception was the performance of SVM 
on the imbalanced data set, which exhibited large variance 
for all feature embeddings and the worst overall classifica- 
tion performance (compared to the other classifiers). 


When comparing classification performance on the imbal- 
anced and the balanced data sets we observed that our 
pipeline using VAEs showed significant performance improve- 
ments compared to other methods for finding feature embed- 
dings. While the unlabeled and the balanced data sets stem 
from an adaptive version of Calcularis the imbalanced data 
was collected using a fixed task sequence. As our method 
shows larger improvements on the imbalanced data, we be- 
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lieve CNN-SAE learned an embedding that is robust beyond 
adaptive ITS. The relative improvements of our feature em- 
beddings is smallest for SVM on the balanced data set. We 
believe that this is due to ability of the SVM to learn com- 
plex decision boundaries given sufficient data. However, the 
ability for complex decision boundaries renders SVMs more 
vulnerable to class imbalance, yielding performance at ran- 
dom level on the imbalanced data set. 


Comparison to specialized models. Recently, a spe- 
cialized Naive Bayes classifier (S-NB) for the detection of 
developmental dyscalculia (DD) was introduced presenting 
a set of features optimized for the detection of DD [17]. 
The development of S-NB including the set of features was 
based on the balanced data set used in this work. In com- 
parison to S-NB, our approach relies on timing data only 
and the extracted features are independent of the classifi- 
cation task. We compared the performance of S-NB to our 
CNN-SAE model on both data sets. For the balanced data 
set we found an AUC of 0.94 for the specialized model (S- 
NB) compared to an AUC of 0.86 for Naive Bayes using our 
feature embedding. On the imbalanced data set we found 
an AUC of 0.67 for S-NB compared to an AUC of 0.77 us- 
ing Logistic Regression with our feature embedding. These 
findings demonstrate that while our feature embedding per- 
forms slightly worse on the balanced data set (for which the 
S-NB was developed), we significantly outperform S-NB by 
15% on the imbalanced data set, which suggests that our 
VAE model automatically extracts feature embeddings that 
are more robust than expert features. 


Robustness on sample size. Ideally, a classifier’s perfor- 
mance should gracefully decrease as fewer data is provided. 
A good feature embedding allows a classifier to generalize 
well based on few labeled examples because similar samples 
are clustered together in the feature embedding. We there- 
fore investigated the robustness of the different feature rep- 
resentations with respect to the training set size. For this we 
used the balanced data set where we varied the training set 
size between 7 (10% of the data) and 62 (90% of the data) 
by random label-stratified sub-sampling. Figure 5 compares 
the AUC of the different feature embeddings over different 
sizes of the training set. In case of Naive Bayes and Logis- 
tic Regression our embedding provides superior performance 
for all training set sizes. For large enough data sets SVM 
using the raw feature data (Direct, grey) is performing as 
well as using our embedding (CNN-SAE, blue). However, 
for smaller data sets starting at 30 samples the performance 
of SVM based on the raw features declines more rapidly 
compared to the SVM based on our feature embedding. 


5. CONCLUSION 


We adapted the recently developed variational auto-encoders 
to educational data for the task of semi-supervised clas- 
sification of student characteristics. We presented a com- 
plete pipeline for semi-supervised classification that can be 
used with any standard classifier. We demonstrated that ex- 
tracted structures from large scale unlabeled data sets can 
significantly improve classification performance for different 
labeled data sets. Our findings show that the improvements 
are especially pronounced for small or imbalanced data sets. 
Imbalanced data sets typically arise in EDM when detecting 
relatively rare conditions such as learning disabilities. Im- 
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Figure 4: Classification performance for different feature embeddings. Our variational auto-encoder (blue) 
outperforms other embeddings by up to 28% (imbalanced data set) and by up to 7.5% (balanced data set). 
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Figure 5: Comparison of classifier performance on the balanced data for different training set sizes (moving 
average fitted to data points). The features automatically extracted by our variational auto-encoder (blue) 
maintain a performance advantage even if the training size shrinks to 7 samples (10% of the original size). 


Table 1: Comparison of our method to alternative embeddings. Our approach using a variational auto-encoder 
(CNN-SAE) significantly outperforms other approaches for most cases. The best score for each metric and 
classifier is shown in bold. *= statistically significant difference (t-test with Bonferroni correction, a = 0.05). 


Direct PCA Kernel PCA CNN-SAE 

AUC RMSE AUPR Acc. AUC RMSE AUPR Acc. AUC RMSE AUPR Acc. AUC RMSE AUPR Acc. 
Imbalanced data set 
Logistic Regression 0.54 0.27 0.18 0.91 0.54 0.25 0.17 0.93 0.61 0.25 0.16 0.93 0.78* 0.24* 0.28* 0.94* 
Naive Bayes 0.51 0.29 0.23 0.91 0.50 0.29 0.10 0.90 0.57 0.28 0.20 0.91 0.70* 0.25* 0.24 0.93* 
SVM 0.55 0.25 0.22* 0.94 040 0.25 0.08 0.94 0.42 0.25 0.09 0.93 0.59 0.25 0.16 0.94 
Balanced data set 
Logistic Regression 0.80 0.44 0.82 0.73 0.80 0.42 0.84 0.73 0.80 0.42 0.83 0.75 0.83* 0.40* 0.84 0.77 
Naive Bayes 0.80 0.49 0.80 0.73. 0.77 0.46 0.77 0.71 0.76 0.46 0.76 0.70 0.86* 0.38* 0.86* 0.80* 
SVM 0.81 0.42 0.84* 0.75 0.79 0.43 0.81 0.73 0.80 0.43 0.83 0.73 0.83 0.40* 0.81 0.79* 


proved classification results with simple classifiers such as 
Logistic Regression might indicate that VAEs learn feature 
embeddings that are interpretable by human experts. In 
the future we want to explore the learnt representations and 
compare it to traditional categorizations of students (skills, 
performance, etc.). Additionally, we want to extend our 
results to include additional feature types and data reliabil- 
ity indicators to handle missing data. Although we trained 
our networks on comparatively small sample sizes, the pre- 
sented method scales (due to mini-batch learning) to much 
larger data sets (>100K users ) allowing the training of more 
complex VAE. Moreover, the generative model pg (x|z) that 
is part of any VAE can be used to produce realistic data 
samples [29]. Up-sampling of the minority class provides a 
potential way to improve the decision boundaries for classi- 


fiers. In contrast to common up-sampling methods such as 
ADASYN [8], VAE-based sampling does not require nearest 
neighbor computations which makes them better applicable 
to small data sets. Preliminary results for random subsets 
of the balanced data set showed improvements in AUC by 
up-sampling based on VAE of 2-3% compared to ADASYN. 
While we applied our method to the specific case of detecting 
developmental dyscalculia, the presented pipeline is generic 
and thus can be applied to any educational data set and 
used for the detection of any student characteristic. 
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